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Abstract 

Recent years have seen significant interest in 
designing networks that are self-healing in the 
sense that they can automatically recover from 
adversarial attack. Previous work shows that 
it is possible for a network to automatically re- 
cover, even when an adversary repeatedly deletes 
nodes in the network. However, there have not 
yet been any algorithms that self-heal in the case 
where an adversary takes over nodes in a net- 
work. In this paper, we address this gap. 

In particular, we describe a communication 
network over n nodes that ensures the following 
properties, even when an adversary controls up 
tot < (1/4 — e)n nodes, for any positive e. First, 
the network provides point-to-point communica- 
tion with bandwidth and latency costs that are 
asymptotically optimal. Second, 0(t(log* n) 2 ) 
message corruptions occur in expectation, before 
the adversarially controlled nodes are effectively 
quarantined so that they cause no more corrup- 
tions. We present empirical results showing that 
our approach may be practical. 

"Fool me once, shame on you. Fool me twice, 
shame on me. " - English proverb 

1 Introduction 

Self-healing algorithms protect critical proper- 
ties of a network, even when that network is 
under repeated attack. Such algorithms only 
expend resources when it is necessary to repair 
damage done by an attacker. Thus, they pro- 
vide significant resource savings when compared 
to traditional robust algorithms, which expend 
significant resources even when the network is 
not under attack. 

The last several years have seen exciting re- 



sults in the design of self-healing algorithms J2J 
El E21 HS|. Unfortunately, none of these 
previous results handle Byzantine faults, where 
an adversary takes over nodes in the network and 
can cause them to deviate arbitrarily from the 
protocol. This is a significant gap, since tradi- 
tional Byzantine-resilient algorithms are notori- 
ously inefficient, and the self-healing approach 
could significantly improve efficiency. 

In this paper, we take a step towards address- 
ing this gap. For a network of n nodes, we de- 
sign self-healing algorithms for communication 
that tolerate up to a 1/4 fraction of Byzantine 
faults. Our algorithms enable any node to send 
a message to any other node in the network with 
bandwidth and latency costs that are asymptot- 
ically optimal. 

Moreover, our algorithms limit the expected 
number of message corruptions. Ideally, each 
Byzantine node would cause O(l) corruptions; 
our result is that each Byzantine node causes 
an expected 0((log* n) 2 ) corruptions^] Thus, we 
must amend our initial proverb to: 'Fool me 
once, shame on you. Fool me w((log* re) 2 ) times, 
shame on me. " 

1.1 Our Model 

We assume an adversary that is static in the 
sense that it takes over nodes before the algo- 
rithm begins. We call the nodes controlled by 
the adversary bad and the remaining nodes good. 
The bad nodes may arbitrarily deviate from the 
protocol, by sending no messages, excessive num- 
bers of messages, incorrect messages, or any com- 

1 Recall that log* n or the iterated logarithm function 
is the number of times logarithm must be applied itera- 
tively before the result is less than or equal to 1. It is an 
extremely slowly growing function: e.g. log* 10 10 = 5 



bination of these. The good nodes follow the 
protocol. We assume that the adversary knows 
our protocol, but is unaware of the random bits 
of the good nodes. 

We further assume that each node has a 
unique ID. We say that node p has a link to node 
q if p knows g's ID and can thus directly com- 
municate with node q. Also, we assume the ex- 
istence of a public key digital signature scheme, 
and thus a computationally bounded adversary. 
Finally, we assume a partially synchronous com- 
munication model: any message sent from one 
good node to another good node requires at most 
A time steps to be sent and received, and the 
value A is known to all nodes. However, we as- 
sume a rushing adversary, so the bad nodes re- 
ceive all messages from good nodes in a round 
before sending out their own messages. 

Our algorithms make critical use of quorums 
and a quorum graph. We define a quorum to 
be a set of 0(logn) nodes, of which at most 
a 1/4 fraction are bad. Many results show 
how to create and maintain a network of quo- 
rums |51 El El El El E EH- All of these results 
maintain what we will call a quorum graph in 
which each vertex represents a quorum. The 
properties of the quorum graph are: 1) Each 
node is in O(logra) quorums; 2) For any quorum 
Q, any node in Q can communicate directly to 
any other node in Q; and 3) For any quorums Qi 
and Qj that are connected in the quorum graph, 
any node in Qi can communicate directly with 
any node in Qj and vice versa. 

Communication in the quorum graph typically 
occurs as follows. When a node s sends an- 
other node r some message m, there is a canon- 
ical quorum path through the quorum graph, 
Qi, Q2, • • • , Qe, where s G Qi and r G Qt- This 
path is determined by the ID's of both s and r. 
A naive way to route the message is for s to send 
m to all nodes in Q\. Then for i = 1 to i — 1, for 
all nodes in Qi to send m to all nodes in Qi+i, 
and for all nodes in Qi+i to do majority filtering 
on the messages received in order to determine 
the true value of m. Unfortunately, this algo- 
rithm requires 0(^log 2 n) messages. This paper 
shows how to reduce this cost. 



1.2 Our Results 

This paper provides a self-healing algorithm, 
SEND, that sends a message from a source node 
to a target node in the network. Our main result 
is summarized in the following theorem. 
Theorem 1.1. Assume we have a network with 
n nodes and t < (1/4 — e)n bad nodes, for any 
positive e, and a quorum graph as described 
above. Then our algorithm ensures the follow- 
ing. 

• For any call to SEND, the expected latency 
is 0(£) and the expected number of mes- 
sages is 0(£ + logn), in an amortized sense j^] 

• The total number of times that a mes- 
sage can be corrupted in a call to SEND 
is 0(t(log* n) 2 ) in expectation. 

1.3 Related Work 

Our results are inspired by recent work on self- 
healing algorithms [2 E2 El El ES] • A com- 
mon model for these results is that the follow- 
ing process repeats indefinitely: an adversary 
deletes some nodes in the network, and the algo- 
rithm adds edges. The algorithm is constrained 
to never increase the degree of any node by more 
than a logarithmic factor from its original de- 
gree. In this model, researchers have presented 
algorithms that ensure the following properties: 
the network stays connected and the diameter 
does not increase by much [21 [T41 [7] ; the shortest 
path between any pair of nodes does not increase 
by much [8] ; and expansion properties of the net- 
work are approximately preserved |12j . 

Our results are also similar in spirit to those of 
Saia and Young [15] and Young et al. [19], which 
both show how to reduce message complexity 
when transmitting a message across a quorum 
path of length I. The first result, [IS], achieves 
expected message complexity of 0{t\ogn) by 
use of bipartite expanders. However, this re- 
sult is impractical due to high hidden constants 

2 In particular, if we perform any number of message 
sends through quorum paths, where lu is the longest such 
path, and C is the sum of the quorums traversed in all 
such paths, then the expected total number of messages 
sent will be 0(C + t ■ £m log 2 n log* n). Note that, since t 
is fixed, for large £ this value is 0{C). 
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and high setup costs. The second result, [19], 
achieves expected message complexity of 0(£). 
However, this second result requires the sender 
to iteratively contact a member of each quo- 
rum in the quorum path. Thus, while practical 
for some peer-to-peer applications, it has draw- 
backs in 1) load-balancing: for example, a single 
node broadcasting to all nodes through a tree 
of quorums must send to 9{n) messages; and 2) 
anonymity: the ID of the sender is learned by at 
least one node in each quorum. 

As mentioned earlier, several peer-to-peer net- 
works have been described that provably enable 
reliable communication, even in the face of ad- 
versarial attack O El El EH El H] . To the best of 
our knowledge, our approach applies to each of 
these networks, with the exception of [3J. In par- 
ticular, we can apply our algorithms to asymp- 
totically improve the efficiency of the peer-to- 
peer networks from El Ell E3 E] ■ 

1.4 Organization of Paper 

The rest of this paper is organized as follows. 
In Section [2j we describe our algorithms. In 
Section [3j we prove the correctness of these al- 
gorithms; the main result of this section is a 
proof of Theorem |1.1| We give empirical results 
showing how our algorithms can improve the effi- 
ciency of the butterfly network of [3] in Section |4j 
Finally, we conclude and describe problems for 
future work in Section [5j 

2 Our Algorithms 

Algorithm 1 SEND(m,r) 

Assumptions: Node s wants to send message 
m to node r. 

1. Node s calls SEND-LEADER (m,r) 

2. With probability l/(log*n) 2 , node s calls 
CHECK (m, r) 

In this section, we describe our algorithms 
SEND, SEND-LEADER, CHECK, and UP- 
DATE. The main technical challenge of our pa- 
per is in the design of the algorithm CHECK, 
which is described in Section 12 . 21 



Algorithm 2 SEND-LEADER(m, r) 

Assumptions: m is the message to be sent, and 
r is the destination. We let Q\, Q2, . . . Qi be the 
quorum path from s to r in the quorum graph. 

1. Node s sends m to every node in Q\ 

2. Each node in Q\ sends the message it re- 
ceives to the leader 02 of quorum Q2- 

3. The leader q2 checks for conflicting mes- 
sages. If messages conflict, qi aborts and 
initiates a call to UPDATE . 
Otherwise, for i = 2, . . . , t — 2 do 

(a) The leader qi of quorum Qi sends the 
message it receives to the leader qi + i 
of quorum Qi+i 

4. The leader qn_\ sends the message it receives 
to all nodes in Qi 

5. The node r checks for conflicting messages. 
If messages conflict, r initiates a call to 
UPDATE. 



2.1 Overview 

Our algorithms maintain a leader for each quo- 
rum. We maintain the invariant that, for every 
quorum Q, all nodes in Q know the leader of Q. 
Additionally we maintain that for every quorum 
Q' , such that Q' has an edge to Q in the quorum 
graph, all nodes in Q' know the leader of Q. 

As described previously, we assume that when 
node s wants to send a message to a node r, there 
is a canonical quorum path Qi, Q2, ■ ■ ■ , Qi, deter- 
mined by the IDs of s and r, such that s 6 Q\ 
and r £ Qi. Our main algorithm SEND (Al- 
gorithm [j}, has s call SEND-LEADER with the 
message to be sent and the ID of the node r. 

In the SEND-LEADER (Algorithm^, s sends 
the message to all nodes in Q\; these nodes send 
to the leader of Q2; and the message is prop- 
agated by quorum leaders until reaching Qt-\. 
Then the leader of Qt-\ sends to all nodes in 
Qe, and these nodes send directly to r. 

SEND-LEADER is vulnerable to corruption. 
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Thus, with probability l/(log*n) 2 , SEND next 
calls CHECK (Algorithm |f , which has the fol- 
lowing two properties: 1) with probability at 
least 1/2, it determines if a message was cor- 
rupted in the previous call to SEND-LEADER; 
and 2) it is resource efficient, requiring only 
0(£(log* n) 2 ) messages. 

Unfortunately, while CHECK can determine if 
a corruption occurred, it does not determine the 
location where the corruption occurred. Thus, if 
CHECK detects a corruption, UPDATE (Algo- 
rithm [4]) is called. When called after a corrup- 
tion occurs, UPDATE identifies two neighbor- 
ing quorums Qi and Qi+i in the path, for some 
1 < i < I, such that one of the leaders of the two 
quorums is bad. Then, new leaders are elected 
for both Qi and Qi+i. 

2.2 CHECK 

The algorithm CHECK is described formally as 
Algorithm[3} We now give an overview. CHECK 
runs for 4 log* n rounds, and maintains a subset 
Si for each quorum Qi in the quorum path. Ini- 
tially all Si are empty and s generates a pub- 
lic/private key 

In each round, for each quorum Qi, 1 < i < I, 
a new node is added to Si. This new node is 
chosen uniformly from all nodes in Qi. Also, in 
each round, a message m' is constructed, which 
consists of 4 items: 1) the original message m 
sent during SEND-LEADER; 2) the public key 
k p generated by s at the start of the call to 
CHECK, 3) the ID of the receiver node, r; and 
4) an array of random numbers, R, that are used 
to pick the nodes added to all the Si sets in the 
current round. The message m! is signed using 
the private key k s . 

The node s sends ml to all nodes in S%; then 
for all 1 < i < t, all nodes in S% send to the 
new node in Si + ± and this new node sends to all 
nodes in Si + \. Finally, all nodes in Si send to 
the node r. 

If a node has previously received the public 
key, k p , then it verifies each subsequent message 
with it. A node initiates a call to UPDATE if it 
receives inconsistent messages or fails to receive 
and verify some expected message. 

An example run of CHECK is illustrated in 



Received m 1 receive m' 

gbg|bbbbbbbbbbbgbgb 

bgggbgbg|bbb|gbbbggg 
gbgggbbbbg|b|bggbbgb 

bgbbbgggbgggbbggbg 
Figure 1: Example run of CHECK 



Figure [TJ In this figure, there is a column for 
each quorum in the quorum path and a row for 
each round of CHECK. For a given row and col- 
umn, there is a G or B in that position depending 
on whether the node selected in that particular 
round and that particular quorum is good(G) or 
bad(B). The left bar in each row specifies the 
rightmost quorum in which there is some good 
node that knows ml . The right bar in each row 
specifies the leftmost quorum in which there is 
some good node that does not know m! . 

Note that, as rounds progress, the left bar can 
only move rightwards, because a node that has 
already received k p will call UPDATE unless it 
receives messages signed with k p for all subse- 
quent rounds. Further, note that the right bar 
can only move leftwards, since there is all-to-all 
communication between the nodes in the sets Si. 
Finally, note that when these two bars meet, a 
corruption is detected. 

Intuitively, the reason CHECK requires only 
4 log* n rounds is because of a probabilistic re- 
sult on the maximum length run in a sequence 
of coin tosses. In particular, if we have a coin 
that takes on value "B" with probability 1/4, and 
value "G" with probability 3/4, and we toss it x 
times, then the expected length of the longest 
run of B's is logx. Thus, if in some round, the 
distance between the left bar and the right bar 
is x, we expect in the next round this distance 
will shrink to logx. Intuitively, we might ex- 
pect that, if the quorum path is of length I, then 
0(log* £) rounds will suffice before the distance 
shrinks to 0. This intuition is formalized in Lem- 
mas [[] and [2] of Section [3j 
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2.3 UPDATE 

The UPDATE algorithm is described formally as 
Algorithm [4j This algorithm has each node pre- 
viously involved in SEND broadcast all messages 
they have received to their own quorum and to 
the neighboring quorums. A pair of nodes x and 
y is declared to be in conflict if: 1) x was sched- 
uled to send a message to y at some point in this 
call to SEND; and 2) the message that x reported 
that it received is different than the message that 
y reported that it received. UPDATE then finds 
at least one pair of nodes that are in conflict. 

Moreover, in the case where a corruption oc- 
curred during the call to SEND-LEADER, UP- 
DATE will identify a pair of neighboring quo- 
rums Qj and Qj+i, for some 1 < j < £ such 
that one of the two quorums currently has a 
bad leader. Both leaders from Qj and Qj+i 
are thrown out and new leaders are elected. If 
these new leaders are not connected, then UP- 
DATE keeps electing new leaders for these two 
quorums until two leaders are elected that still 
have an edge between themj^] The properties of 
UPDATE are given in Lemma [3} 

Since each quorum has a 3/4 fraction of good 
nodes, if we throw out two both leaders of Qj 
and Qj+i, and perform new elections, we make 
progress. In particular, the expected number of 
quorums that have good leaders will increase by 
a positive amount. Intuitively, we would expect 
that after this process repeats enough times, all 
quorums will have good leaders. This intuition 
is formalized in Lemma 01 

UPDATE makes use of a leader election proto- 
col to 1) enable a quorum of nodes to agree on a 
leader; and 2) ensure that the leader agreed on is 
good with probability at least 3/4. We now de- 
scribe how a quorum can elect a leader by using 
secure multiparty computation (SMPC) [13]. 

Let n' be the number of nodes in the quorum 
and let each node in the quorum be assigned a 
unique integer from 1 to n'. First, each node 

3 No edge is ever removed between a pair of good lead- 
ers, and there are at least a 3/4 fraction of good leaders in 
each quorum. Thus, for each election, there is probability 
9/16 of electing two good leaders. Hence, in expectation, 
we require only a constant number of elections before two 
connected leaders will be elected. 



in the quorum chooses an input: an integer uni- 
formly distributed between 1 and n' . Then, the 
nodes perform SMPC to find the output: the 
sum of all their inputs modulo n' . The node in 
the quorum associated with this output number 
becomes the new leader of the quorum. 

The leader selected will be uniformly dis- 
tributed provided that at least a 3/4 fraction of 
the nodes in the quorum are good. Finally, this 
leader election protocol runs in O(l) time, and 
requires 0(log 2 n) total messages. 

2.4 Some Details 

During the course of our algorithms, edges will 
be removed from the network. We assume that 
subsequent to the removal of an edge between 
node p and node q, no message is ever sent, or 
expected to be sent, from p to q or vice versa. 

We note that in the the algorithm SEND- 
LEADER, the node s sends messages to all nodes 
in Q\. This additional communication ensures 
that the nodes in Q\ all have received a message 
from s. This ensures that in the case where s 
is a bad node, it cannot cause two good leaders 
to be in conflict during a call to UPDATE. The 
same property holds true for the nodes in Qi and 
the node r. This additional communication adds 
0(log n) to the message cost of SEND-LEADER. 

3 Analysis 

In this section, we prove our main result, The- 
orem We first require several lemmas. 
Throughout this section, we let n q represent the 
number of quorums in the quorum graph, and 
let all logarithms be base 2. 

The proof of the following lemma is deferred 
to the appendix due to space constraints. 

Lemma 1. Consider a sequence of x nodes, 
where each node in the sequence is bad inde- 
pendently with probability 1/4. Then the prob- 
ability that there is any substring of length 
max(l,logx) bad nodes in this sequence is no 
more than 1/2. 

The next lemma shows that the algorithm 
CHECK catches corruptions with probability at 
least 1/2. 

Lemma 2. Assume some bad leader has cor- 
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rupted a message in the last call to SEND- 
LEADER. Then when the algorithm CHECK is 
called, with probability at least 1/2, some node 
will call UPDATE. 

Proof. This proof makes use of the following 
two facts. 

Fact 1. Assume that in round i, UPDATE will 
be called if a good node chosen in round i or 
less at some quorum, Qj, reliably transmits its 
message to a good node chosen in round i or less 
at some quorum, Qk, where 1 < j < k < £. 
Then in round i + 1, there exist j' and k', where 
j < f <k'< k such that: 

• Property 1: UPDATE is called if a node cho- 
sen in round i + 1 or less at Qj' transmits its 
message reliably to a node chosen in round 
i + 1 or less at Qy . 

To prove this fact, note that the good nodes 
in Qj that are chosen by CHECK in rounds i or 
less, know s's public key, k p . Thus they must 
receive uncorrupted messages signed by s's pri- 
vate key, k s , in all rounds subsequent to i, or else 
UPDATE will be triggered. 

Fact 2. Fix a round i + 1 of CHECK and let f 
and k' be indices satisfying Property 1 of Fact 1 
that minimize the value k' — j' . Then in round 
i + 1, UPDATE is called unless all nodes that are 
chosen between quorums Qj' and Qk 1 are bad. 

To show Fact 2, assume by way of contradic- 
tion that it is false. Then there exists x' , such 
that j' < x' < k! , where the node p' chosen in 
round i + 1 at Q x > is good. There are two cases 
for what happens in round i + 

• Case 1: The node p' receives the message 
ml sent by s. But then the indices j' and x' 
satisfy Property 1 of Fact 1 and x'—j' < k! — 
j' . This contradicts the assumption that the 
indices j' and k! had the minimal distance 
among all indices satisfying Property 1 in 
Fact 1. 

• Case 2: The node p' does not receive the 
message m! sent by s. But then the indices 



x' and k' satisfy Property 1 of Fact 1 and 



k' 



< k' — j' . This again contradicts 



the assumption that the indices j' and k' 
had the minimal distance among all indices 
satisfying Property 1 in Fact 1. 

Now we can use these two facts to prove the 
lemma. Let Xi be an indicator random variable 
that it is equal to 1 if (k' — j') < log(/c — j) and 
otherwise. By Lemma [TJ each Xi is 1 with prob- 
ability at least 1/2. We require at least log* n of 
the Xi random variables to be 1 in order for some 
node to call UPDATE^ Let X = J2t=i *" x i- 
Then E(X) = 21og*n, and since the X{S are 
independent, by Chernoff bounds, 



Pi(X < (1 -5)2 log* n) < 



(1 + 5) 



1+5 



2 log* n 



When 1-5 = 1/2, 5 = \. For n > 16, 



Pr (X < log* 




2 log* n 



< 



Thus the probability that CHECK succeeds in 
finding a corruption and calling UPDATE is at 
least 1/2. □ 

We say that a node q that is a leader of a quo- 
rum Q is deposed if the quorum Q elects a new 
leader uniformly at random with replacement. 
The following lemma shows that if a corruption 
is caught during a call to SEND-LEADER, UP- 
DATE deposes at least one pair of leaders, and 
each pair contains at least one bad node. Also, 
UPDATE always removes at least one edge when 
it is called, and at least one endpoint of each re- 
moved edge is a bad node. The proof is deferred 
to the appendix. 

Lemma 3. If some bad leader has corrupted a 
message in the last call to SEND-LEADER, then 
the algorithm UPDATE will 1) identify a pair 
of neighboring quorums Qj and Qj+i, for some 
1 < j < ^ such that at least one of the two 
quorums currently has a bad leader; and 2) elect 
new leaders for these two quorums. Moreover, 



4 Note that j' = k' corresponds to the case where a 
conflict is detected. 



5 Here we assume £ < n. However, we can achieve the 
same asymptotic results assuming that £ is bounded by a 
polynomial in n. 



6 



UPDATE will always remove at least one edge 
from the network, and at least one endpoint of 
each edge removed will be a bad node. 

The next lemma bounds the expected num- 
ber of times that a pair of neighboring leaders is 
deposed. The proof is in the appendix. 

Lemma 4. Assume there are j quorums in the 
quorum graph that have bad leaders, for any pos- 
itive integer j. Then, in expectation, the number 
of corruptions that must be caught by CHECK 
before the leaders of all quorums are good is no 
more than 2j. 

We can now prove our main theorem. 

Proof of Theorem. We start with resource 
costs. By Lemma [3| each time UPDATE is 
called, at least one edge is removed from the net- 
work. Hence, the resource costs of all calls to 
UPDATE are bounded as the number of calls to 
SEND grows large. Thus, for the amortized cost, 
we consider only the cost from calls to CHECK 
and SEND-LEADER. When sending through a 
path of £ quorums, SEND-LEADER has latency 
0(£) and message cost 0{£ + \ogn). CHECK has 
latency and message cost 0(£(log* n) 2 ), but it is 
called only with probability l/(log*n) 2 . Hence 
the amortized expected cost of SEND is 0{£) la- 
tency and 0{£ + logn) messages. 

More specifically, if we perform any number 
of message sends through quorum paths, where 
Cm is the longest such path, and C is the sum 
of the quorums traversed in all such paths, then 
the expected total number of messages sent will 
be 0(C + t ■ Im log 2 n log* n). This is true since 
each call to UPDATE costs 0(£ M log 2 n log* n) 
messages, since we perform 0(log* n) Byzantine 
agreements over at most £m quorums. Note that, 
since t is fixed, for large L this value is O(C). 

We now bound the expected number of cor- 
ruptions. Let X be a random variable giving 
the number of quorums which initially have bad 
leaders. For 1 < i < n q , let pi be the probabil- 
ity that the i-th quorum has a bad leader. By 
linearity of expectation, E(X) = Y^i=\Pi- Note 
that for 1 < i < n q , pi equals the number of bad 
nodes in the i-th. quorum divided by the size of 
the z-th quorum. Thus, the denominator for each 
Pi is 9 (logn) and the sum of all the numerators 



is O(tlogn), since each node is in O(logn) quo- 
rums. This implies that E{X) = 0(t) 

Now by Lemma |4j the expected number of 
times CHECK must catch a corruption before 
the leaders of all quorums are good, is no more 
than 2X. Hence, by linearity of expectation, 
the expected number of corruptions that must 
be caught is 2E(X) = 0(t). Finally, if a bad 
node caused a corruption during a call to SEND- 
LEADER, then, by Lemmas [2] and [3j with prob- 
ability at least 1/2, CHECK will catch it. As 
a consequence, it will call UPDATE, which will 
elect two new leaders. UPDATE is thus called 
with probability l/(log* n) 2 , so the expected to- 
tal number of corruptions is 0(i(log* n) 2 ). □ 

4 Empirical Results 
4.1 Setup 

In this section, we empirically compare the mes- 
sage costs and the corrupted message counts 
of two algorithms via simulation. The first al- 
gorithm we simulate is the Butterfly algorithm 
from [3] . This algorithm has no self-healing prop- 
erties, and simply uses all-to-all communication 
between quorums that are connected in a but- 
terfly network. The second algorithm is Loglog, 
wherein we apply a modified version of our self- 
healing algorithm to the butterfly network. 

For the Loglog algorithm, we modify CHECK 
so that it requires fewer messages for practical 
values of n. Instead of requiring 0(^(log* n) 2 ) 
messages per check, we modify it to require 
0(^(log log n) 2 ) messages. When the nodes are 
picked to participate in subquorums, we replace 
incrementally adding one node to the Si sets 
over each of 4 log* n rounds, with directly adding 
log logn nodes in only one round. Effectively, 
each log log n-sized subquorum Si engages in all- 
to-all communication with its neighboring sub- 
quorum Si+i. 

We can show that our modified check fails with 
o(l) probabilty. This new CHECK succeeds if 
every subquorum has at least one good node. 
The probability of any subquorum having only 
bad nodes is at most (i/4)kg]ngn _ i/i og V 
Union-bounding over all £ subquorums, the prob- 
ability of our modified CHECK failing is at most 
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l/log 2 n. For the Butterfly topology, we have 
i = O(logn), so the probability of our modified 
CHECK failing is o(l). 

Our simulations consist of a sequence of 
queries over the network, consisting of a pair 
of nodes s, r, chosen uniformly at random, such 
that s sends a message to r. We simulate an 
adversary who chooses at the beginning of each 
simulation a fixed number of nodes to control 
uniformly at random without replacement. Our 
adversary attempts to corrupt messages between 
nodes whenever possible. Aside from attempting 
to corrupt messages, the adversary performs no 
other attacks to attempt to deny service. 

4.2 Results 

The results of our experiments are shown in 
Figures [2j [3] and [4j These results highlight 
two strengths of our self-healing algorithms 
(Loglog) when compared to algorithms without 
self-healing (Butterfly). First, the cost of a 
query decreases as the total number of queries 
increases, as illustrated in Figure [2} Second, for 
a fixed number of queries, the cost of a query 
decreases as the total number of bad nodes de- 
creases, as illustrated in Figure [3| In particular, 
when there are no bad nodes, Loglog has dramat- 
ically less cost than Butterfly. 

We now describe our results in more detail. 
Figure[2]shows the number of messages per query 
versus the number of queries for Butterfly and 
Loglog when the fraction of bad nodes is g . The 
left plot is for a network of size n = 1,329, and 
the right plot is for a network of size n = 14,116. 
The two curves intersect when the total number 
of queries is 5,909 and 98,168 respectively. These 
intersection points represent an average of 4.4 
queries per node for the left plot, and 7.0 queries 
per node for the right plot. 

Figure [3] shows the number of messages per 
query versus the number of bad nodes for both 
Butterfly and Loglog. In the left plot, the net- 
work size n = 1,329 and the number of queries 
is fixed at 10,000. In the right plot the network 
size is n = 14,116 and the number of queries is 
fixed at 100,000. The two curves intersect when 
the fraction of bad nodes is .182 in the left plot, 
and .126 in the right plot. 



Figure [4] shows the number of corruptions ver- 
sus the number of queries for Loglog, when the 
fraction of bad nodes is 1/8. The left plot is 
for a network of size n = 1,329, and the right 
plot is for a network of size n = 14,116. The 
number of corruptions flattens out at about 7.4 
total corruptions for the left plot, and 408 total 
corruptions for the right plot. 

5 Conclusion and Future Work 

We have presented algorithms that can signif- 
icantly reduce communication cost in attack- 
resistant peer-to-peer networks. The price we 
pay for this improvement is the possibility of 
message corruption. In particular, if there are 
t < n/4 bad nodes in the network, our algorithm 
allows 0((log* n) 2 t) message transmissions to be 
corrupted in expectation, before the bad nodes 
are quarantined so they cause no more corrup- 
tions. We have simulated variants of our algo- 
rithms and demonstrated that they perform well 
in practice, particularly as the number of queries 
grows large. 

Many problems remain open. First, it seems 
unlikely that the smallest number of corruptions 
allowable by an attack-resistant algorithm with 
optimal message complexity is 0((log* n) 2 t). 
Can we improve this upper bound to 0(t) or 
else prove a non-trivial lower bound? Second, 
can we apply techniques in this paper to prob- 
lems more general that enabling secure com- 
munication? For example, can we create self- 
healing algorithms for distributed computation 
with Byzantine faults? Finally, can we optimize 
constants and make use of heuristic techniques 
in order to significantly improve our algorithms' 
empirical performance? 
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Figure 4: Number of corruptions versus number of queries 



Algorithm 3 CHECK (m, r) 



Initialization: Node s generates public/private 
key pair k p , k s to be used throughout this 
procedure. Throughout the procedure, if a node 
has previously received k p , then it verifies each 
subsequent message with it. If a node receives 
inconsistent messages or fails to receive and 
verify an expected message, then it initiates a 
call to UPDATE. We again let Qi,Q 2 ,---Qi 
be the quorum path from s to r in the quorum 
graph. Finally, each Sj is empty initially, for 
1<3<£. 

for i «— 1, . . . , 4 log* n do 

1 . s sets R to be a £ by C log n array of 
random numbers, where C log n is the 
maximum size of any quorum. For any 
k between 1 and C' log n, Rj^ is a uni- 
formly random number between 1 and k 
that will be used by quorum Qj. 

2. s chooses a node, x\ uniformly at random 
from all neighbors in Qi, and adds this 
node to Si 

3. s sets m! to be the messages signed by k s 
consisting of m, k p , r, and R 

4. s sends m! to all i nodes in Si and also 
the leader of Qi 

5. For j <- 1 

(a) Sj «— Sj plus the leader of Qj 

(b) The nodes in Sj set Xj + i to be a node 
selected uniformly at random from 
the nodes in Qj+i- To do this, they 
let x be the number of nodes in Qj+i 
and select the Rj :X node in Qj+i (us- 
ing a canonical ordering of the nodes 
based on their IDs). 

(c) The nodes in Sj set their new value 
of Sj + i to be equal to their old value 
union the node Xj+i 

(d) The nodes in Sj send to Xj + i both 
m' and the IDs of all nodes in Sj+i. 

(e) The node Xj + i sends m! to all the 
nodes in Sj+i. 

6. The nodes in Si send m! to the node r 
end for 



Algorithm 4 UPDATE 

Assumptions: All broadcasts are done via 
Byzantine agreement 

1. The node, x, making the call to UPDATE 
broadcasts this fact to its quorum, Q' , along 
with all the messages that x has received dur- 
ing this call to SEND. The nodes in Q 1 check 
that x received inconsistent messages before 
proceeding. 

2. The quorum Q' propagates the fact that a call 
to UPDATE is occurring, via all-to-all com- 
munication, to all quorums Qi,Q 2 , ■ ■ ■ , Qi- 

3. s broadcasts all messages it sent in this call to 
SEND to the nodes in Qi and these messages 
are sent via all-to-all communication to all 
remaining quorums Q 2 ,Q3, ■ ■ ■ Qi- 

4. Each node involved in this call to SEND com- 
piles all messages they have received (and 
from whom) in this call to SEND, and broad- 
casts these messages to the nodes in its own 
quorum, and all neighboring quorums in the 
quorum path. The node s broadcasts the 
messages to the nodes in Qi; and the node r 
broadcasts the messages to the nodes in Qi. 

5. A pair of nodes x and y is declared to be in 
conflict if: 1) x was scheduled to send a mes- 
sage to y at some point in this call to SEND; 
and 2) the message that x reported that it 
received is different than the message that y 
reported that it received. For every pair of 
nodes x, y that are in conflict the edge be- 
tween x and y is removed. Specifically, the 
edge (x, y) is removed from the edge list of 
all nodes in the quorums of x and y. 

6. A pair of leaders, (qj, qj+i), is deposed if they 
are in conflict and qj is not in conflict with 
qj-i- The quorums Qj and Qj+i then both 
hold elections for two new leaders. If the new 
elected leaders have no edge between them, 
then we repeatedly elect two new leaders until 
the two leaders are connected. 
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A Appendix - Deferred Proofs 

In this appendix, we provide proofs that were 
deferred from the main paper. 
Lemma [lj Consider a sequence of x nodes, 
where each node in the sequence is bad inde- 
pendently with probability 1/4. Then the prob- 
ability that there is any substring of length 
max(l,logx) bad nodes in this sequence is no 
more than 1/2. 

Proof. The probability of a specific substring of 
log x nodes being all bad is 

,j_xlogX ^ ^ 

Union bounding over all possible substrings of 
length logx, the probability of any all-bad sub- 
string existing is at most 

1 1 1 

x z x 2 

for x > 2. For x = 1, max(l,logx) = 1, and 
the probability of having a subsequence of 1 bad 
node in a sequence of 1 bad node is 1/4. □ 
Lemma [3[ If some bad leader has corrupted 
a message in the last call to SEND-LEADER, 
then the algorithm UPDATE will 1) identify a 
pair of neighboring quorums Qj and Qj+i, for 
some 1 < j < I such that at least one of the two 
quorums currently has a bad leader; and 2) elect 
new leaders for these two quorums. Moreover, 
UPDATE will always remove at least one edge 
from the network, and at least one endpoint of 
each edge removed will be a bad node. 
Proof. First, we show that if a pair of nodes x 
and y is in conflict, then at least one of them 
is bad. Assume not. Then both x and y are 
good. Then node x would have truthfully re- 
ported what it received; any message that x re- 
ceived would have been sent directly to y; and y 
would have truthfully reported what it received 
from x. But this is a contradiction, since for x 
and y to be in conflict, y must have reported that 
it received from x something different than what 
x reported receiving. 

Now consider the case where a bad leader 
corrupted a message in the last call to SEND- 
LEADER. By the definition of corruption, there 



must be two good leaders qj and q^ such that 
j < k and qj received the message m! sent by 
node s, and qk did not. We now show that some 
pair of leaders between qj and qt will be in con- 
flict. Assume this is not the case. Then for all 
x, where j < x < k — 1, leaders q x and q x +i are 
not in conflict. But then, since leader qj received 
the message m! , and there are no pairs of leaders 
in conflict, it must be the case that the leader q^ 
received the message m! . This is a contradiction. 
Thus, UPDATE will find two leaders that are in 
conflict, and at least one of them will be bad. 

Now we prove that at least one pair of nodes 
is found to be in conflict as a result of trigger- 
ing UPDATE. Assume that no pair of nodes is 
in conflict. Then for every pair of nodes x and 
y, such that x was scheduled to send a message 
to y during any round i of CHECK, x and y 
must have reported that they received the same 
message in round i. In particular, this implies 
via induction, that for every round i, for all j, 
where 1 < j < £, all nodes in the sets Sj must 
have broadcasted that they received the message 
m' that was initially sent by node s in round i. 
But if this is the case, the node x that initially 
called UPDATE would have received no incon- 
sistent messages. This is a contradiction since 
in such a case, node x would have been unsuc- 
cessful in trying to initiate a call to UPDATE. 
Thus, some pair of nodes must be found to be in 
conflict, and at least one of them is bad. □ 

Note that the above proof shows that, even if 
the node s is bad, a call to UPDATE will remove 
an edge between two nodes, at least one of which 
is bad. Thus, a bad s can only force calls to 
UPDATE a fixed number of times. 

Lemma [4[ Assume there are j quorums in the 
quorum graph that have bad leaders, for any pos- 
itive integer j. Then, in expectation, the number 
of corruptions that must be caught by CHECK 
before the leaders of all quorums are good is no 
more than 2j. 

Proof. By Lemma [3j if a corruption occurred in 
SEND-LEADER, it is caught by CHECK, and 
UPDATE is called, then two neighboring quo- 
rums, Qj,Qj + \, for some 1 < j < £, will be 
identified such that at least one leader of these 
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quorums are bad. Then, the current leaders of 
these quorums will be deposed and new leaders 
will be elected for both quorums. 

We will model the properties of our quorum 
graph with a walk on a Markov chain with states 
0, 1, . . . , n g jj The walk will be at state i if exactly 
i quorums in the quorum graph have bad leaders. 
Note that state is an absorbing state: if no 
quorums have bad leaders, there will no longer 
be any corruptions, and so there is no possibility 
of any good leaders being deposed. 

When two new leaders are elected, let Pbb be 
equal to the probability that two bad leaders are 
elected; Pb g be equal to the probability that one 
bad and one good leader are elected; and P gg be 
equal to the probability that two good leaders are 
elected. Note that P bb < 1/16, and P gg > 9/16. 

Transitions occur whenever a corruption is 
caught by CHECK and UPDATE is called. If 
this happens, two leaders are deposed (one of 
which is guaranteed to be bad); and two new 
leaders are elected. We want to bound the ex- 
pected number of corruptions that are detected 
until we reach state 0, which is equivalent to 
bounding the expected number of steps in the 
walk until we reach state 0. 

We now give transition probabilities on the 
Markov chain to upper bound the expected num- 
ber of corruptions until all leaders are good. 
When in any state i, < i < n q , the walk tran- 
sitions to state i — 1, with probability 9/16; the 
walk transitions to state i + 1 with probability 
1/16; and the walk stays in state i with proba- 
bility 6/16. If the walk is in state 0, it will stay 
there with probability 1. 

Let f(i) be the expected number of steps on 
this Markov chain to reach state 0, given that 
we are currently in state i. Note that f(i) is an 
upper bound on the expected number of pairs 
of leaders that must be deposed before all lead- 
ers are good, given that there are i bad leaders 
currently. 

Solving for / is a simple variant of the "gam- 
blers ruin" problem. We include it here for com- 
pleteness. We have the following equations for 

6 recall n q = 8{n) is the number of quorums in the 
quorum graph 



/. /(0) = 0; and for all i, 1 < i < n q 

Rewriting this equation, we have for all i, 1 < 
i < n q 

10/(t) = 16 + 9/(* - 1) + /(* + 1). 

For any j, where 1 < j < n q , if we sum the above 
equations over all i, 1 < i < j — 1, we obtain: 

M = 9/(i-l) + /(l)-16(j-l)-9/(0) 
= 9/0" - 1) + /(I) - 16(j - 1) 

We now prove that for all j, where < j < n q , 
that /(/) < 2j. The base case, /(0) = 0, is 
trivially true. For the inductive step, we have 

M = 9/(i-l) + /(l)-16(j-l) 
< 18(7-1) + 2- 16(j-l) 
= 2j □ 
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